Bag of What? Simple Noun Phrase Extraction for Text Analysis
نویسندگان
چکیده
Social scientists who do not have specialized natural language processing training often use a unigram bag-of-words (BOW) representation when analyzing text corpora. We offer a new phrase-based method, NPFST, for enriching a unigram BOW. NPFST uses a partof-speech tagger and a finite state transducer to extract multiword phrases to be added to a unigram BOW. We compare NPFST to both ngram and parsing methods in terms of yield, recall, and efficiency. We then demonstrate how to use NPFST for exploratory analyses; it performs well, without configuration, on many different kinds of English text. Finally, we present a case study using NPFST to analyze a new corpus of U.S. congressional bills. For our open-source implementation, see http://slanglab.cs.umass.edu/phrases/.
منابع مشابه
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe...
متن کاملAccurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information
In this paper we investigate the impact of candidate terms filtering using linguistic information on the accuracy of automatic keyphrase extraction from scientific papers. According to linguistic knowledge, the noun phrases are most likely to be keyphrases. However the definition of a noun phrase can vary from a system to another. We have identified five POS tag sequence definitions of a noun p...
متن کاملAn Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation
In this paper, we describe a method for structural noun phrase disambiguation which mainly relies on the examination of the text corpus under analysis and doesn't need to integrate any domain-dependent lexicoor syntactico-semantic information. This method is implemented in the Terminology Extraction Sotware LEXTER. We first explain why the integration of LEXTER in the LEXTER-K project, which ai...
متن کاملA Noun Phrase Parser of English
A noun phrase parser is useful for several purposes, e.g. for index term generation in an information retrieval application; for the extraction of collocational knowledge from large corpora for the development of computational tools for language analysis; for providing a shallow but accurately analysed input for a more ambitious parsing system; for the discovery of translation units, and so on....
متن کاملExtracting Noun Phrases in Subject and Object Roles for Exploring Text Semantics
In tune with the recent developments in the automatic retrieval of text semantics, this paper is an attempt to extract one of the most fundamental semantic units from natural language text. The context is intuitively extracted from typed dependency structures basically depicting dependency relations instead of Part-Of-Speech tagged representation of the text. The dependency relations imply deep...
متن کامل